ENVST 206: Environmental Data Science, Hamilton College

Instructions

There are 10 questions to this activity. Save your answers in word document that you will hand in on Blackboard using a .pdf or .docx extension. Keep your script file in your GitHub folder and make sure that all changes are pushed to GitHub. Code for all questions should be clearly commented with the question number. You will include a link to your script file as a part of your final question in this activity. The assignment is worth 30 points.

Learning objectives

1. Learn about data describing species populations and impacts on ecosystems

2. Learn how to conduct statistical hypothesis testing

3. Interpret statistical output in R

t-tests

The consumption of vegetation by herbivores has been well established to impact the carbon cycling through impacts on reductions on plant biomass. These impacts can even change the energy, water, and nutrient cycling in an area.

Brown lemmings are a type of rodent that lives in the Arctic in Alaska and eastern Siberia.

An example of a lemming. CC3:Frode Inge Helland

Lemming populations fluctuate over time. When lemming populations reach high numbers in an ecosystem, they have been observed to consume up to 90% of the plant biomass in an area. However, lemming populations in the Arctic have been decline over the previous decades and some regions have seen lemmings disappear. A study done by Lara et al. in 2017 focused on better understanding the impact of lemming herbivory on the carbon cycle in Arctic ecosystems. This research can help indicate the potential impacts that declines in lemming herbivory might have on changing Arctic ecosystems.

Lara et al. looked at the impact of lemmings by using fenced exclosure that kept lemmings out (pictured below). The authors found that there were sufficient high lemming population outbreaks in frequent intervals to properly capture long term lemming impacts on the ecosystem.

They measured many components of the carbon cycle including how much carbon dioxide was released from the soil versus taken in by plants between areas with exclosures and without exclosures. Methane is another greenhouse gas that can be released by bacteria in Arctic soils typically when soils are flooded or hold a high volume of water. The vegetation at the soil surface can influence the release of methane from soils. Methane emissions from Arctic soils are a major research focus of Arctic climate change research since changes in methane production in Arctic soils may amplify climate change. We will focus on analyzing the impacts of lemming grazing of vegetation on methane fluxes. Let’s read in the data.

#make sure you change the file path to match your computer
ch4 <- read.csv("/Users/hkropp/Google Drive/teaching/2020/Fall 2020/EnvDataSci/activity/data/activity 3/lemming_herbivory.csv")

Here, you will notice that there are two variables in this data, there are CH4 fluxes (mgC m–2 day–1) and there is an herbivory column where a value of Ctl indicates that it was a control plot with no exclosure that was was open to lemming grazing. The Ex rows will indicate that an exclosure was applied preventing lemmings from accessing the site. It will be helpful to treat the herbivory column as a factor:

ch4$herbivory <- as.factor(ch4$herbivory)

Let’s start by taking a look at the data. Let’s make a boxplot to get a basic idea of the data. Here positve values indicate that methane is being emitted from the plot and a negative value indicates that there is methane uptake occurring over the plot surface.

plot(ch4$CH4_Flux ~ ch4$herbivory, xlab ="Treatment", 
     ylab="CH4 fluxes (mgC m –2 day–1) ")

The boxplot shows breaks the data up into quartiles (the values where we observe 0,25,50,75,and 100% of the data). Here the thick black lines show the median of the data. This is the value where 50% of the data observations are above the value and below the value. The grey shaded box shows the range between 25-75%. The whiskers or lines show the rest of the range of data. Note that this function automatically seperates potential outliers and shows them as open points. However, this is based on a calculation and these observations may be legitimate observations to include in our analysis. In this case, the observation is well within an observable and reasonable range, it just does not occur frequently in the exclusion plots. We have no reason to believe this data should be thrown out and will keep it in our analysis.

We can see here that methane fluxes are higher in the control plots were lemmings can freely graze. Let’s examine if there is a statistical difference between the grazed and the exclosure plots. First, let’s make sure that the data meet the assumptions of a t-test. You will want to check that each grazing treatment is normally distributed. The Shapiro-Wilk test can be used to test normality more quantitatively than just looking at the histogram of data. In R, the function for the Shapiro Wilk test is shapiro.test and the only argument is the vector of data that you want to test.

#use the shapiro wilks test to look for normality in each treatment
#shapiro-wilk test on grazing plots
shapiro.test(ch4$CH4_Flux[ch4$herbivory == "Ctl"])
## 
##  Shapiro-Wilk normality test
## 
## data:  ch4$CH4_Flux[ch4$herbivory == "Ctl"]
## W = 0.87763, p-value = 0.08173

The shapiro-wilk test is set up as a statistical test. This means that you are testing the null hypothesis:

H0: the data are normally distributed.

HA: the data are not normally distributed.

You’ll use the conventional confidence level (\(\alpha\) = 0.05) for our statistics in this class. In this test, we’ll rely on the p-value associated with the test statistic to assess how this outcome compares to our confidence level. A p-value of less than 0.05 would indicate that we should reject our null hypothesis since it would be unlikely to observe this test statistic value if the data were normally distributed. If the p-value is greater than 0.05, then we can assume that our data is normally distributed. You can see that the control plots do not deviate from a normal distribution. Next let’s test the exclusion plots:

#shapiro-wilk test on grazing exclusion plots
shapiro.test(ch4$CH4_Flux[ch4$herbivory == "Ex"])
## 
##  Shapiro-Wilk normality test
## 
## data:  ch4$CH4_Flux[ch4$herbivory == "Ex"]
## W = 0.93325, p-value = 0.4158

Data in both groups is normally distributed. Before running the t-test, we can look at whether the variances are equal. You will use the Bartlett test in R. This is another statistical test in which you evaluate whether to reject or accept a statistical null hypothesis:

H0: Groups have similar variances

HA: Groups do not have similar variances

In R, you use the bartlett.test function to run the test statistic and to evaluate whether the p-value is below the confidence level. A p-value of less than 0.05 would indicate we would be unlikely to observe the difference in variances under the null hypothesis.

#use bartlett test since testing for equal variance
bartlett.test(ch4$CH4_Flux ~ ch4$herbivory)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  ch4$CH4_Flux by ch4$herbivory
## Bartlett's K-squared = 0.21236, df = 1, p-value = 0.6449

Here, the test statistic is well above 0.05 so you can assume the variances are equal.

You can now conduct a t-test now that you know the data meets all of the assumptions needed to reliably apply the test. In R, the t.test function performs a t-test. Keep in mind this is a two sample t-test. Our data is formatted so that the identifier of the treatments is in the herbivory column and the data for both groups is in the CH4_Flux column. The easiest way to tell R how to use this data is using a formula expression. Typically formulas follow this expression: Dependent variable ~ Independent variables. That symbol is called a tilde and the key for it is usually just above your tab key and next to the 1 key on your keyboard. Whenever, you see an expression you can think of it as dependent variable is a function of the independent variables. This is a common way to express statistical relationships and we will build on it with more complex relationships. Here we’ll use this formula notation to indicate that the t test for flux data should be a function of the groups indicated by the herbivory column.

t.test(ch4$CH4_Flux ~ ch4$herbivory)
## 
##  Welch Two Sample t-test
## 
## data:  ch4$CH4_Flux by ch4$herbivory
## t = 1.5328, df = 21.569, p-value = 0.1399
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.644609 30.844370
## sample estimates:
## mean in group Ctl  mean in group Ex 
##         18.814645          5.714765

Analysis of Variance (ANOVA)

Next, we’ll examine the impact of the urban environment on insect populations. Urban built environments are often warmer environments with less water availability. Residential areas of urban environments can provide more vegetated environments, but plant cover is often not native and mostly dominated by a single plant cover such as grass. However, the urban environment can affect survival and reproduction of some insect species. Other insects may be well equipped to survive in many urban environments and there is a concern that urbanization can drive declines in the diversity of insects. This may negatively impact essential part of ecosystems driving pollination, the food web, and the decomposition of plant material.

To evaluate the impact of the urban environment on insects, Adams et al. (2019) trapped insects across the Los Angeles area and examined the number of species at each location. The term species richness is used to describe how many different species can be found in a given area. It is often a useful metric that can contribute to our understanding of species diversity in an area.

#read in insect data
datI <- read.csv("/Users/hkropp/Google Drive/teaching/2020/Fall 2020/EnvDataSci/activity/data/activity 3/insect_richness.csv")

As you learned in class, ANOVA is a widely used technique for testing for differences in three or more groups. Here you will look at the Richness and the urbanName column. urbanName gives the type of urban environment the species richness was measured in. You will want to specify urbanName as a factor. The urbanType column provides an identifying number originally used in by the authors. We are only focusing on a few urban areas in this analysis. Below is a table that provides a more detailed description of each area used in the study site. The numbers in the table coincide with the urbanType column.

It will be helpful to convert the names to factors in R.

datI$urbanName <- as.factor(datI$urbanName)

In ANOVA, we use the F statistic to evaluate differences in data between groups. Here the F-ratio that we use for our statistical evaluation is calculated based on the variance of the data around the mean. First the among group (SSamong, sometimes referred to as between) sum of squares calculates the squared difference between each group mean and the overall mean of all the data and sums up the differences for all groups. The within group (SSwithin, also referred to as residual) sum of squares calculates the squared difference between each data observation and its group mean and sums these differences for all observations. The mean squares is then calculated using the number of groups (SSamong) and both the number of groups and observations (SSwithin). This table also shows the total SS for good measure, but we don’t often look at it since SStotal = SSamong + SSwithin. The F-ratio is calculated by dividing the mean square for among groups by within groups. The p-value provides a probability for observing the calculated F-ratio and all values greater under the null distribution. You will want to use the standard confidence level of 0.05 to make your assessment.

Image credit: Gotelli & Ellison, An Introduction to Ecological Statistics
Before moving on, I want to explore the meaning of this F-ratio calculation a little more. Let’s take a look at some example data shown below. Think about the amount of variability between groups relative to the variability within groups. For these example plots, I’ve jittered the actual data points around the each group area so you can see the data points throughout the y-axis but they randomly appear wihtin each box plot so that there is minimal overlap. The overall mean for all the data is shown as a blue line and the means for each group are shown as red points. The quartiles for the data are shown as boxplots.

Next you will run the ANOVA test on the insect data to examine whether the different urban environments impact species richness. Let’s take a look at the insect data in a similar format to the plots above.

Now that you’ve had a chance to check the assumptions and visualize the data, you are ready to run the ANOVA in R. There are two steps to this test in R. You first specify a linear model with the lm function. The next step is to run the ANOVA calcluations using aov. Both of these functions will simply run the calculations. To view the ANOVA table of results, you will print the table using the summary function. In the lm function we will specify a formula similar to the t-test where your Richness ~ Urban.

#specify model for species richness and urban type
in.mod <- lm(datI$Richness ~ datI$urbanName)
#run the ANOVA
in.aov <- aov(in.mod)
#print out ANOVA table
summary(in.aov)
##                 Df Sum Sq Mean Sq F value  Pr(>F)   
## datI$urbanName   3   1944   647.9   4.898 0.00254 **
## Residuals      236  31216   132.3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here the table will print out such that the upper row labeled with our group vector name (datI$urbanName) is our among group variability (sum of squares for group means compared to overall mean) and the residuals row is the within group variability. You can see the F value in the upper right part of the table and the associated p-value.

The ANOVA results alone do not give us a complete picture of the differences between urban landcover types. Some groups actually look quite similar. We must check to see how the means compare between groups to make a full statistical conclusion. You would want to highlight that only some groups may differ from each other when drawing conclusions about the urban environment impacts on species richness. This pairwise comparison between groups is called a posthoc test. We’ll use Tukey’s HSD (Honestly Significant Difference) test to conduct this comparison.There is a simple built in function in R. You just have to give the name of the variable that has the ANOVA results.

#run Tukey HSD
tukeyT <- TukeyHSD(in.aov)
#view results
tukeyT
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = in.mod)
## 
## $`datI$urbanName`
##                         diff         lwr        upr     p adj
## Developed-Dense     1.433333  -3.5966663  6.4633329 0.8819569
## Natural-Dense      12.583333   3.3998525 21.7668141 0.0026479
## Suburban-Dense      3.785714  -0.8060261  8.3774547 0.1456297
## Natural-Developed  11.150000   1.7397324 20.5602676 0.0128693
## Suburban-Developed  2.352381  -2.6776186  7.3823805 0.6210733
## Suburban-Natural   -8.797619 -17.9810999  0.3858618 0.0658910

Here you will notice that there are two group names in each row indicating which groups are being compared. A significant difference between groups means that the confidence interval in the difference between the means will not overlap with zero and will have a p-value below our confidence level threshold of 0.05. Note that this test prints a p-adj value. This means it has made some adjustments to account for the fact that were a comparing many means and should account for this in our calculations.

You can also use the plot function and input the entire test variable to generate a plot that shows the confidence levels for the mean comparisons. This can help you get a better visualization for how close these intervals are to zero since there are a lot of numbers printed out in the table.

#make a plot
#make axes labels smaller than usual to fit on plot using cex.axis 
plot(tukeyT, cex.axis=0.75)

Typically the plots showing the means and measures of variance use letters to show the when groups are similar. For example, an ANOVA with four groups that all significantly differed from each other would show a different letter for each group like a,b,c,d. Here’s our panel a example, where all four groups are different.

If the first two groups were statistically similar, but differed from the last two groups, then the numbers would be assigned: a,a,b,c. Work through how you would designate letters demonstrating which groups are similar. Note that a group can be assigned two letters at the same time. An example would be group 1 is similar to group 2 and nothing else, both group 1 and group 2 get an a. However if group 2 is also similar to group 3 in addition to group 1, group 2 and group 3 would get a label of b. So group two would get the label a,b.

There is a function for calculating means across factor data simultaneously in R. This is helpful for cases where you have only a handful of groups (aggregate from last week when you are calculating many means such as over 10).

tapply(datI$Richness, datI$urbanName, "mean")
##     Dense Developed   Natural  Suburban 
##  19.83333  21.26667  32.41667  23.61905

Now you can fully assess the differences in insect species richness between the different urban landcover types. Comparing which groups have higher and lower insect species richness helps us interpret the patterns in the data more broadly in addition to the statistical test.

Chi-squared goodness of fit test

The next test is useful if you want to collect categorical data (also known as nominal data). Categorical data fits into distinct predefined categories. For example, if you counted the number of blue and red cars driving down College Hill Road in an hour, you have data in two distinct categories: # of red cars and # of blue cars.

You will look at data collected by Farnsworth in 2004. Farnsworth looked at the influence of legal protections on populations of rare plant species. Here we have two variables each with two outcomes. The first describes legal protection with outcomes being protected or not protected. The second is the status of population: stable/increasing or decreasing. Here a decreasing observation for a species would indicate that the number of plants in that species is declining over time. You will set up our data in a contigency table. Rows and columns are assigned the outcomes for the two different variables and each cell of the table shows the number of observations in those categories. Each observation describes the number of species

#set up contigency table
species <- matrix(c(18,8,15,32), ncol=2, byrow = TRUE) 
colnames(species) <- c("Not protected", "Protected")
rownames(species) <- c("Declining", "Stable/Increase")

You can use a mosaic plot to visualize the proportion of the total observations in each category. You will see the size of the boxes is weighted by the relative frequency of species within each category.

#make a mosaic plot with an informative title and axes labels
mosaicplot(species, xlab="population status", ylab="legal protection",
           main="Legal protection impacts on populations")

Visually it looks like there may be some difference in population status that depends on legal protections. However, it is difficult to know whether this difference is significant. You can test whether the frequency of observations in each category meets the expectation under a null hypothesis. Here the statistical hypotheses are:

H0: Observations in each category occur at expected frequencies

HA: Observations in each category differ from expected frequencies

The test statistic is calculated by comparing the observations to expectations:
\(\chi\)2 = \(\sum (O-E)^2/E\)

This statistic uses a \(\chi\)2 distribution to evaluate how likely we are to observe deviations from the expected value under the null hypothesis. Higher deviations from expected values result in a greater \(\chi\)2 statistic.

The \(\chi\)2 goodness of fit test in R uses the chisq.test function.

#Conduct a chi-squared test
chisq.test(species)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  species
## X-squared = 7.9642, df = 1, p-value = 0.004771

Citations

Lemming grazing & surface fluxes

Lara, M., D. Johnson, C. Andresen, R. Hollister, and C. Tweedie. (2017). Peak season carbon exchanges shifts from a sink to a source following 50+ years of herbivore exclusion in an Arctic tundra ecosystem. Journal of Ecology 105:122-131.

Data citation: Mark Lara. 2016. Long-term functional impact of Herbivore Exclusion. Arctic Data Center. doi:10.18739/A2CC0TT3H.

Insect Species Richness Adams, B., E. Li, C. Bahlai, E. Meineke, T. McGlynn, and B. Brown. (2020). Local- and landscape-scale variables shape insect diversity in an urban biodiversity hot spot. Ecological Applications. 30: e02089.

Data citation Adams, Benjamin et al. (2020), Local and landscape scale variables shape insect diversity in an urban biodiversity hotspot., v2, Dryad, Dataset, https://doi.org/10.5061/dryad.7d7wm37rd

Species protections

Farnsworth, E.J. (2004). Patterns of plant invasions at sites with rare plant species throughout New England. Rhodora: 97-117.